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Abstract 

Potential functions are critical for computational studies of protein structure prediction, folding, and sequence 
design. A class of widely used potentials for coarse grained models of proteins are contact potentials in the form 
of weighted linear sum of pairwise contacts. However, these potentials have been shown to be unsuitable choices 
because they cannot stabilize native proteins against a large number of decoys generated by gapless threading, 
when the number of native proteins is above 300. We develop an alternative framework for designing protein 
potential. We describe how finding optimal protein potential can be understood from two geometric viewpoints, 
and we derive nonlinear potentials using mixture of Gaussian kernel functions for folding and design. In our 
experiment we use a training set of 440 protein structures repre senting a major portion of all known protein 
structures, and about 14 million structure decoys and sequence decoys obtained by gapless threading. The 
optimization criterion for obtaining parameters of the potential is to minimize bounds on the generalization 
error of discriminating protein structures and decoys not used in training. We succeeded in obtaining nonlinear 
potential with perfect discrimination of the 440 native structures and native sequences. For the more challenging 
task of sequence design when decoys are obtained by gapless threading, we show that there is no linear potential 
with perfect discrimination of all 440 native sequences. Results on an independent test set of 194 proteins also 
showed that nonlinear kernel potential performs well, with only 3 structures and 14 sequences misclassified, which 
compare favorable with the results of 7 structures and 37 sequences misclassified using optimal linear potential. 
We conclude that more sophisticated formulation other than the simple weighted sum of contact pairs can be 
useful. 



Key words: Contact potential; nonlinear potential; kernel models; protein folding; protein design; optimization; 
support vector. 
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1 Introduction 



Potential function plays critical roles in computational studies of protein folding, protein structure prediction, 
and protein sequence design [1-4]. A variety of empirical potential functions have been developed for coarse- 
grained models of proteins, where amino acid residues are not represented at atomic details. One prominent class 
of potentials arc knowledge-based potentials derived from statistical analysis of database of protein structures [5- 
8] . In this class of potentials, the interaction between a pair of residues are estimated from its relative frequency 
in database when compared with a reference state or a null model. This approach has been successfully applied 
in fold recognition, in threading, and in many other studies [7 14]. The drawback of this class of potential is 
that there are several conceptual difficulties. These include the neglect of chain connectivity in the reference 
state, and the problematic implicit assumption of Boltzmann distribution [15-17]. An alternative approach is 
to find a set of parameters such that the potential fimctions arc optimized by some criterion, e.g., maximized 
energy difference between native conformation and a set of alternative (or decoy) conformations [16, 18-25]. This 
approach has been shown to be very effective in recognizing native structures from alternative conformations 
[25] and in folding membrane proteins [26]. However, if a large number of native protein structures are to be 
simultaneously stabilized against a large number of decoy conformations, no such potential functions can be 
found [20,22]. 

There are three key steps in developing effective empirical potential function using optimization: (1) the 
functional form of the potential, (2) the generation of a large set of decoys for discrimination, and (3) the opti- 
mization techniques. The initial step of choosing an appropriate functional form so far has been straightforward. 
Empirical pairwise potentials are usually all in the form of weighted linear sum of interacting residue pairs (see 
reference [27] for an exception). In this functional form, the weight coefficients are the parameters of the po- 
tential, which are optimized for discrimination. The same functional form is also used in statistical potential, 
where the weight coefficients are derived from database statistics. For the task of decoy generation, an efficient 
method to obtain millions of decoys is gapless threading, i.e., a decoy conformation can be obtained by mount- 
ing the sequence of the native protein to different parts of the conformation of a larger protein of an unrelated 
structure [19]. Alternatively, a large number of challenging decoys can be generated by using either a chain 
growth method [28,29] or a protein structure refinement method [30], both are technically more complex. The 
optimization techniques that have been used include perceptron learning and linear programming [20,22], and 
analytical solution has also been obtained when assumptions about the distribution of data are made [31]. The 
objectives of optimization are often maximization of energy gap between native conformation and the average 
of decoy conformation, or energy gap between native and decoys with lowest energy, or the ^-score of the native 
structure [18,32-35]. 

In this study, we develop an alternative formulation of protein potential function in the form of mixture 
of nonlinear Gaussian kernel functions. We also use a different optimization technique based on quadratic 
programming. Instead of maximizing the energy gap, here a function related to bounds of expected classification 
errors is optimized [36-39]. In addition, we explore the relationship between protein folding and protein design. 
We study a simplified version of the protein folding problem. Our goal is to identify the native protein structure 
from an ensemble of alternative or decoy structures for a given amino acid sequence. We also study a simplified 
version of the protein design problem. Our goal of protein sequence design is to identify a protein sequence that 
is most compatible with a given three-dimensional coarse-grained structure. In this study, we do not address the 
problem of how to generate candidate conformation or candidate sequence by searching either the conformation 
space or the sequence space. 

Experimentation with the nonlinear function developed shows that it can discriminate simultaneous 440 native 
proteins against 14 million structure decoys generated by gapless threading. Results of similar experiments with 
perceptron learning was negative, as reported in literature [22]. We also test our potential function for protein 
design. Using the same training set of 440 proteins and a corresponding set of 14 millions sequence decoys, we 
succeed in developing a nonlinear function that correctly classifies all 440 native sequences. In contrast, we cannot 
obtain a weighted linear sum potential using the state-of-the-art interior point solver of linear programming 
method as reported in [20,40], such that it is capable of classifying perfectly all the native sequences. We also 
perform blind tests for both native structure and native sequence recognition. Taking 194 proteins unrelated 
to the 440 training set proteins, the nonlinear potential achieves a success rate of 92.8% and 98.4% in sequence 
design and in structure recognition, respectively. Both results compare favorably with optimal linear potential 
and statistical potential. 
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The rest of the paper is organized as follows. We first describe theory and model of linear and nonlinear 
function, including the kernel model and the optimization technique. We then explain details of computation. 
We further describe experimental results of learning and results of blind test. We conclude with discussion. 

2 Theory and Models 

Modeling Protein Folding Potential. To model protein computationally, we first need a method to 
describe its geometric shape and its sequence of amino acid residues. Frequently, a protein is represented by a 
d-dimensional vector c G M"^. For example, we can represent a protein as a vector c G M*^, d = 20, by measuring 
the solvent accessible surface areas of each of the 20 residue types. A method that is widely used is to count 
nonbonded contacts of various types of amino acid residue pairs in a protein structure. In this case, the count 
vector ceIR'',d = 210, is used as the protein descriptor. Once the structural conformation of a protein s and its 
amino acid sequence a is given, the protein description / : (s,a) i— *• M.'^ will fully determine the d-dimensional 
vector c. In the case of contact vector, / corresponds to the mapping provided by specific contact definition, 
e.g., two residues are in contact if their distance is below a specific cut-off threshold distance. 

Based on the classical experiments of Anfinsen [41] , a fundamental requirement for protein folding is that the 
native structure sn with native amino acid sequence ajv must have an energy iJ(/(sjVj <ijv)) that is the lowest 
among a set of alternative structure called decoys T> = oat)}, where the native amino acid sequence 
takes a decoy conformation sd that is different from the native conformation s;v [19,42]: 

H{f{sN, aw)) < H{f{sD, ajv)) for all (sd, ajv) e T> 

Sometimes we can further require that the energy difference must be greater than a constant b> 0: 

H{f{sN, On)) + b< H{f{sD, ajv)) for all {sd, on) G V 

A widely used functional form for protein potential function H is the weighted linear sum of pairwise contacts 
[5-8, 20, 21]. The linear sum energy H is then: 

H{f{s,a)) = H{c) = wc. (1) 

As soon as the weight vector w is specified, the potential function is fully defined. For such linear potentials, 
the basic requirement for protein potential is then: 

w ■ {cn - cd) < 

or 

w ■ {cn — Cd) + b < 

if we require that the energy difference between a native structure and a decoy must be greater than a real value 
b. 

Two Geometric Views of Linear Protein Folding Potentials. There is a natural geometric view of 
the inequality requirement for weighted linear sum potentials. A useful observation is that each of the inequalities 

divides the space of R'' into two halfs separated by a hyperplane. The hyperplane is defined by the normal vector 
(cjv — Cd) and its distance 6/||cjv — Cd\\ from the origin. The weight vector w must be located in the half- 
space opposite to the direction of the normal vector (cjv — cd) (Fig la). This half-space can be written as 
w • {cn - Cd) + b < 0. 

When there are many inequalities to be satisfied simultaneously, the intersection of the half-spaces forms 
a convex polyhedron [43]. If the weight vector is located in the polyhedron, all the inequalities are satisfied. 
Potentials with such weight vector w can discriminate the native protein from the set of all decoys. 

For each native protein i, there is one convex polyhedron Vi formed by the set of inequalities associated 
with its decoys. If our potential can discriminate simultaneously n structures from a union of sets of decoys, 
the weight vector w must be located in a smaller convex polyhedron V that is the intersection of the n convex 
polyhedra: 

n 

w€P=f]Vi. 

i=l 
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Figure 1: Geometric views of the inequality requirement for weighted linear protein potential, (a). In the first geometric 
view, the space M*^ is divided into two half-spaces by the hyperplane w ■ {cn — cd) + 6 < 0. The hyperplane is defined 
by the normal vector (cat — cd) and its distance h/Wc^ — c/)|| from the origin. The weight vector must be located 
in the half-space opposite to the direction of the normal vector (cjv — cd). (b). A second geometric view of the 
inequality requirement for linear protein potential. The space M'^ is divided into two half-spaces by the hyperplane 
w ■ {cn — Cd) + 6 < 0. Here the hyperplane is defined by the normal vector w and its distance VI 1""^! I f^o"^ the origin. 
All points {cat — cd} are located on one side of the hyperplane away from the origin. 



There is yet anotiier geometric view of the inequality requirements. The relationship w ■ (cjv — Cd) + b <0 
for all decoys and native protein structures can be regarded as a requirement that all points {cn — c^} are 
located on one side of a hyperplane, which is defined by its normal vector w and its distance fe/llit^H to the origin 
(Fig lb). We can show that such a hyperplane exists if the origin is not contained within the convex hull of the 
set of points {c^ — cn} (see Appendix). 

The second geometric view is dual and mathematically equivalent to the first geometric view. In the first 
view, a point cn — cn determined by the structure-decoy pair cn = {sn, cn) and cjj = {sn, cin) corresponds to 
a hyperplane representing an inequality, a solution weight vector w corresponds to a point located in the final 
convex polyhedron. In the second view, each structure-decoy pair is represented as a point Cat — Cd in R'', and 
the solution weight vector w is represented by a hyperplane separating all the points C = {cn — cd} from the 
origin. 



Optimal Linear Potentials. Several optimization methods have been applied to find the weight vector 
w. The Rosenblantt perceptron method works by iteratively updating an initial weight vector wq [21,25]. 
Starting with a random vector, e.g., wq = 0, we test each native protein and its decoy structure. Whenever the 
relationship w ■ {c^ — cd) + b < is violated, we update w by adding to it a scaled vector rj ■ {c^ — cd). The 
final weight vector is therefore a linear combination of protein and decoy count vectors: 

w = ^7] ■ {cn - Cd) = ^ aN ■ Cn - ^ an ■ Cd. (2) 

NeJ^ DeT> 

Here Af is the set of native proteins, and T> is the set of decoys. The set of coefiicients {aff} U {ao} gives a dual 
form representation of the weight vector w as an expansion of the training examples, including both native and 
decoy structures. 

If the final convex polyhedron V is non-empty, there can be infinite number of choices of w, all with perfect 
discrimination. But how do we find a weight vector w that is optimal? This depends on the criterion for 
optimality. For example, one can choose the weight vector w that minimizes the variance of energy gaps 

as used in reference 
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between decoys and natives: arg^ min (w • (cjv — cd)) — X^d ('"' ' i^N — cd)) 



[20] , or minimizing the Z-score of a large set of native proteins, or minimizing the Z-score of the native protein 
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and an ensemble of decoys [35, 44], or maximizing the ratio R between the width of the distribution of the energy 
and the average energy difference between the native state and the unfolded ones [18,45]. A series of important 
works using perceptron learning and other optimization techniques [18, 20, 21, 24, 46] showed that effective linear 
sum potentials can be obtained. 

Here we describe yet another optimality criterion according to the second geometric view. We can choose 
the hyperplane {w, b) that separates the points {cjv — cd} with the largest distance to the origin. Intuitively, we 
want to characterize proteins with a region defined by the training set points {cat — c_d}. It is desirable to define 
this region such that a new unseen point drawn from the same protein distribution as {cn — Cd} will have a high 
probability to fall within the defined region, and non-protein points following a different distribution, which is 
assumed to be centered around the origin when no a priori information is available, will have a high probability 
to fall outside the defined region. In this case, we are more interested in modeling the region or support of the 
distribution of protein data, rather than estimating its density distribution function. For linear potential, regions 
are half-spaces defined by hyperplanes, and the optimal hyperplane {w,b) is the one with maximal distance to 
the origin. This is related to the novelty detection problem and single-class support vector machine studied in 
statistical learning theory [36,39,47]. In our case, any non-protein points will need to be detected as outliers 
from the protein distribution characterized by {cn — Cd}- Among all linear functions derived from the same set 
of native proteins and decoys, the optimal weight vector w is likely to have the least amount of mislabellings. 
The optimal weight vector w can therefore be found by solving the following primal quadratic programming 
problem: 

Minimize 

subject to w ■ (cat - cd) + 6 < for all e A/" and D eV. 

The solution maximizes the distance &/| jii'l | of the plane {w, b) to the origin. The dual form of the same quadratic 
programming problem can be written as [48] : 

Minimize 

subject to aj > and a, = 1 

where Xi,Xj e {(cjv — cd)}- 

Nonlinear Potential. However, it is possible that no such weight vector w exists, i.e., the final convex 
polyhedron V = 0"=! ^« ™^^y empty set. First, for a specific native protein i, there may be severe 

restriction from some inequality constraints, which makes Vi an empty set. Some decoys are very difficult to 
discriminate due to perhaps deficiency in protein representation. In these cases, it is impossible to adjust the 
weight vector so the native structure has a lower energy than the decoy. Second, even if a weight vector w 
can be found for each native protein, i.e., w is contained in a nonempty polyhedron, it is still possible that 
the intersection of n polyhedra is an empty set, i.e., no weight vector can be found that can stabilize all native 
proteins against the decoys simultaneously. Computationally, the question whether a solution weight vector 
w exists can be answered unambiguously in polynomial time [49], and recent studies using millions of decoys 
strongly suggest that when the number of native protein structure reaches 300-400, no such weight vector can 
be found [20,22]. When perfect discrimination is impossible, a technique that minimizes the percentage of 
unsatisfied inequalities or the error rate was developed in reference [31]. 

A fundamental reason for this failure is that the functional form of linear sum of pairwise interaction is too 
simplistic. It has been suggested that higher order interactions such as three-body or four-body contacts should 
be incorporated [50-52] . Functions with polynomial terms using upto 6 degree of Chebyshev expansion has also 
been used to represent pairwise interactions [27]. 

Here we propose an alternative approach. At this time we still limit ourselves to pairwise contact interactions. 
We introduce a nonlinear potential function analogous to the dual form of the linear function in Equation (2), 
which takes the following form: 

H{f{s, a)) = H{c) = ^ • K{c, cd) - J2 aN ■ K{c, cn) (3) 

where au > and aAr > are parameters of the potential function to be determined, and cd is a contact vector 
of decoy D in the set of decoys V = {{st>, o,j\r)}, cn a contact vector of native structure N in the set of native 



5 



training proteins M = {(sjv, a^v)}- The difference of this functional form from linear function in Equation (2) is 
that a kernel function K(x^ y) replaces the linear term. A convenient kernel function K is: 

i^(x,y) = e-ll^-yll'/2-^ 

Intuitively, the potential surface has smooth Gaussian hills of height au centered on the location cd of decoy 
structure D, and has smooth Gaussian cones of depth centered on the location cat of native structures A''. 
Ideally, the value of the potential function will be —1 for contact vectors cjv of native proteins, and will be +1 
for contact vectors cd of decoys. 

Optimal Nonlinear Potential. To obtain the nonlinear potential, our goal is to find a set of parameters 

Qjv} such that H{ f{sN, Qat)) has energy value close to —1 for native proteins, and the decoys have energy 
values close to +1. There arc many different choices of {a£),ajv}- We use an optimality criterion originally 
developed in statistical learning theory [37-39]. First, we note that we have implicitly mapped each structure 
and decoy from ]R^^° through the kernel function of K{x,y) = e^H^^yH /^'^ to another space with dimension 
as high as tens of millions. Second, we then find the hyperplane of the largest margin distance separating 
proteins and decoys in the space transformed by the nonlinear kernel. That is, we search for a hyperplane with 
equal and maximal distance to the closest native proteins and the closest decoys. Such a hyperplane can be 
found by obtaining the parameters {a/j} and {ajv} from solving the following Lagrange dual form of quadratic 
programming problem: 

Maximize J2ieMuT> - 5 J2i,jGAruv V^Vj ' ' e-llc--<=^H'/2<^' 
subject to < ai < C 



where C is a regularizing constant that limits the influence of each misclassified conformation [36-39,47], and 
y, = +1 if i is a native protein, and = — 1 if i is a decoy. These parameters lead to optimal classification of 
unseen test set proteins against decoys [36-39,47]. 

Modeling Sequence Design Potential. For protein sequence design, we assume that native sequence 

is more stable on the native conformation than on a different structure of another protein with low sequence 
identity. We use the method of gapless sequence threading to generate sequence decoys by mounting a sequence 
fragment from a different protein of a larger size of an unrelated structure to the native structure [19]. Because 
all native contacts are retained, we find that such sequence decoys are quite challenging. 

For protein design, we seek potential functions that allows the search and identification of sequences most 
compatible with a specific given coarse-grain three-dimensional structure. We use a model analogous to the 
Anfisen experiments. We require that the native amino acid sequence ajv mounted on the native structure s^r 
has the lowest energy compared to a set of unrelated alternative sequences V = {sn, cld} mounted on the same 
native protein structure sjv: 

H(f{sN,aN)) < H{f{sN, ao)) for aU ao e V 

Equivalently, the native sequence will have the highest probability to fit into the specified native structure. This 
is the same principle described in [53-55]. Much work has been done using linear design function of sum of 
contact pairs in the form of H{ f{s, a)) = H{c) = w ■ c [53, 54]. The discussion of the two geometric views of the 
weight vector of linear potential and the optimality criterion also applies to the sequence design problem. 
We now explore nonlinear potential function for sequence design using the following functional form: 

H{f{s, a)) = H{c) = ^ aD • K{c, cd) - ^ ajv • K{c, cn) (4) 

where an > and aN > are coefficients to be determined, and cd — f{sN, do) is the contact vector of a decoy 
sequence Ud mounted on its native protein structure s^r, and cn = f{s]\[,a]s[) is the contact vector of a native 
sequence ajv from the set of native training proteins TV mounted on the native structure sjv- The only difference 
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from nonlinear folding potential of Equation (3) is that here V is a. set of sequence decoys mounted on native 
protein structures, rather than a set of structure decoys. Again, we use kernel function K{x, y) — e^^^^~y^^ /^'^ . 

The optimal parameters {ajv} and {«£)} are obtained similarly by solving the following Lagrange dual convex 
quadratic programming problem: 

Maximize X] ~ ^ H ' "iaje-H'^^-'^^H'/^'^' 

subject to < ttj < C 

where C is a regularizing constant [37, 38], and yi = +1 if i is a native protein, and y, = — 1 if i is a decoy. 

3 Computational Methods 

Alpha Contact Maps. Because protein molecules are formed by thousands of atoms, their shapes are 
complex. In this study we use the count vector of pairwise contact interactions derived from the edge simplices of 
the alpha shape of a protein structure. Edge simplices in the alpha shape represent nearest neighbor interactions 
that are in physical contacts. They encode precisely the same contact information as a subset of the edges in 
the Voronoi diagram of the protein molecule. These Voronoi edges are shared by two interacting atoms from 
different residues, but intersect with the body of the molecule modeled as the union of atom balls. We refer to 
references [56, 57] for further theoretical and computational details. 

Generating Structure Decoys and Sequence Decoys by Threading. Maiorov and Crippen used 

the gaplcss threading method for generating a large number of structure decoys [19]. In this method, the 
sequence of a smaller protein oat is threaded through the structure of an unrelated larger protein and takes the 
conformation sd of a fragment of the larger protein [19]. Along the way, the sequence of the smaller protein 
can take the conformations of many fragments of the larger protein, each provides a structure decoy. With this 
approach, we generate for each native protein {(sjv,ajv)} a set of structure decoys {{sd,cin)} (Fig 2a). 

We can generate sequence decoys in an analogous way, as already suggested in [42,51]. In this case, we 
thread the sequence of a larger protein through the structure of a smaller protein, and obtain sequence decoys 
by mounting a fragment of the native sequence from the large protein to the full structure of the small protein. 
We therefore have for each native protein {sn, ajv) a set of sequence decoys {sn, an) (Fig 2b). 

Protein Data. Following reference [22] , we use protein structures contained in the Whatif database [58] in 
this study. Whatif database contains a representative set of sequence-unique protein structures generated from 
X-ray crystallography. Structures selected for this study all have pairwise sequence identity < 30%, R-factor 
< 0.21, and resolution < 2.1. Whatif database contains less structures than Pdbselect because the R-factor 
and resolution criteria are more stringent [58]. Nevertheless, it provides a good representative set of currently 
all known protein structures. 

We use a list of 456 proteins compiled from the 1998 release (Whatif98) of the Whatif database [22], 
which was kindly provided by Dr. Vendruscolo. There are 192 proteins with multiple chains in this dataset. 
Some of them have extensive interchain contacts. For these proteins, it is possible that their conformations 
may be different if there are no interchain contacts present. We use the criterion of Contact Ratio to remove 
proteins that have extensive interchain contacts. Contact Ratio is defined here as the number of interchain 
contacts divided by the total number of contacts a chain makes. For example, protein lept has four chains 
A,B,C, and D. The intra chain contact number of chain B is 397. Contacts between chain A and chain B is 178, 
between B and C is 220, between B and other heteroatoms is 11. The Contact Ratio of chain B is therefore 
(178 -h 220 + 11)/(397 + 178 + 220 + 11) = 51%. Thirteen protein chains are removed because they all have 
Contact Ratio > 30%. We further remove three proteins because each has > 10% of residues missing with no 
coordinates in the Protein Data Bank file. The remaining set of 440 proteins are then used as training set for 
developing both folding and design potential functions. Using the sequence and structure threading method 
described earlier, we generated a set of 14 080 766 sequence decoys and a set of structure decoys of the same 
size. 



7 





B 



Figure 2: Decoy generation by gapless threading, (a). Structure decoys can be generated by threading the sequence of 
a smaller protein to the structure of an unrelated larger protein, (b). Sequence decoys can be generated by threading 
the sequence of a larger protein to the structure of an unrelated smaller protein. 

Learning Linear PotentiaL For comparison, we develop optimal linear potential following the method 
and computational procedure described in reference [20]. We apply the interior point method as implemented in 
BPMD by Meszaros [59] to search for a weight vector w. We use two different optimization criteria as described 
in reference [20]. The first is: 

Identify w 
subject to w ■ (cjv — cd) < e and \wi\ < 10 



where Wi denotes the i-th. component of weight vector w, and e = 1 x 10 ^. Let C ~ {cn — cd}, and |C| the 
number of decoys. The second optimization criterion is: 



Minimize min J2 i'^ ' i'^N - CD)f - |^ " (cw - c_d)) 



subject to 



w 



\c\ 



(cat - Cd) < e 
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Learning Nonlinear Kernel Potential. We use SVMlight (http://svmlight.joachims.org/) [60] 
with Gaussian kernels and a training set of 440 native proteins plus 14 080 766 decoys to obtain the optimized 
parameter {aNjOio}- The rcgularization constant C takes default value, which is estimated from the training 
set U 2? as implemented in SVMlight: 



Since we cannot load all 14 millions decoys into computer memory simultaneously, we use three heuristic 
strategics for training. In the first method, the total 14 080 766 decoys are divided randomly into 34 subsets. The 
training process starts with all native proteins and the first decoy set. After the training process converges, we 
select the decoy structures that have aj ^ (these decoys are called support vectors [37-39]). Prom the second 
iteration on, we combined the native proteins, the new decoy set, and all decoys that appeared as support vectors 
in any of the previous training steps. These form the new training set for the next iteration of learning. This 
process is continued until all decoys have been exhausted in training. 

In the second method, each of the 34 decoy subsets was combined with the native set in turn to form a 
training set. The decoy support vectors are then selected from each of the 34 training processes. All these decoy 
support vectors are combined, along with native proteins for a final training process. The results are taken as 
the optimized protein potential. 

The third method is similar to the procedure reported in [20] . We first randomly selected a subset of decoys 
that fits into the computer memory. For example, we pick every 51st decoy from the list of 14 million decoys. 
This leads to an initial training set of 276 095 decoys and 440 native proteins. An initial protein potential is then 
obtained after learning. Next the energies for all 14 million decoys and all 440 native proteins are evaluated. 
Three decoy sets were collected based on the evaluation results: the first set of decoys contains the violating 
decoys which have lower energy than the native structures; the second set contains decoys with the lowest 
absolute energy, and the third set contains decoy support vectors identified in previous training process. The 
union of these three subsets of decoys are then combined with the 440 native protein as the training set for the 
next iteration of learning. This process is repeated until the energy difference to native protein for all decoys 
are greater than 0.0. Using this strategy, the number of iterations typically is between 2 and 10. During the 
training process, we set the cost factor j in SVMlight to 120, which is the factor by which training errors on 
native proteins outweighs errors on decoys. 

The value of for the Gaussian kernel K{x, y) = e~^^^~y^^ /'^'^ is chosen by experimentation. We find that 
if the value of cr^ is too large, no {ajv,Q!D} can be found that can perfectly classifies the 440 training proteins 
and their decoys, i.e., the problem is unlearnable. If the value of cr^ is too small, the performance in blind-test 
will deteriorate because of overfitting. For the 440 native proteins and the 14 080 766 sequence decoys, the 
value of cr^ is chosen between 250.0 (for training method 1) and 138.9 (for training method 3). The final folding 
potential is obtained with = 227.3 and the final design potential is obtained with cr^ = 138.9, both derived 
using the third training method. 



Linear Folding Potentials from Structure Decoys. Using perceptron learning and a large number 
of decoys generated by gapless threading, Vendruscolo et al showed that if the number of native proteins exceeds 

270, it is impossible to find parameters w for potential function H{s, a) = w ■ c, such that all native structures 
have lower energies than decoys [22]. That is, no w can be found such that w ■ Cn < w ■ Cd holds for all decoys. 

Tobi et al showed that pairwise contact potential H{s, a) = w ■ c can be found to distinguish a different set 
of 572 proteins from a set of 28 213 009 structure decoys generated by gapless threading [20]. In this study, two 
residues are defined to be in contact if the geometric centers of their side chains are at a distance between 2.0 
A and 6.4 A. To search for the optimal weight vector w, the authors used linear programming solver based on 
interior point method as implemented in BPMD by Meszaros [59]. 

We succeeded in reproducing the results of Tobi et al [20] and found a w vector that enables perfect discrim- 
ination of the same set of 572 proteins used in [20] against 28 261 307 structure decoys we generated by gapless 
threading. We used the same contact definition based on Euclidean distance between geometric centers of side 




(5) 



4 Results 
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chains, as well as the same optimization criterion described earlier in the Methods section. The values of the 
210 pairwise contact potentials (data not shown) are similar although not identical to the values listed in [20]. 

Linear Design Potentials from Sequence Decoys. Structure decoys generated by gaplcss threading 
are obtained by taking a fragment of the structure of a large protein such that it contains exactly the same 
number of amino acid residues as the native protein. However, contacts in the resulting fragment structure often 
contain only a fraction of the total contacts in native protein. 
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Figure 3: Structure decoys generated by threading have many contacts lost. There is 185 residues in protein 1531, and 
305 residues in labe. There are 121 decoys when the sequence of 1531 is threaded on the segments of the structure of 
labe. Between 1/3 - 1/4 of contacts are lost in these decoys. Structure decoys generated by threading are therefore 
not very challenging. 

Design decoys are generated by taking a fragment sequence of a larger protein and thread it onto the native 
conformation of a smaller protein. As a result, although the contact count vector c may be very different, the 
number of contacts in the design decoy is exactly the same as that in native protein. For decoys generated by 
gapless threading, design decoys therefore arc far more challenge to discriminate then structure decoys (Fig 3). 

Our experimentation confirms this observation. After generating 14 080 766 sequence design decoys for the 
440 proteins in the training set, we apply the same interior point method to search for an optimal w that can 
discriminate native sequences from decoy sequences. That is, we search for parameters w for H{s, a) — w ■ c, 
such that w ■ Cm < w • Cd for all sequences. However, we fail to find a feasible solution for the weight vector 
w. This indicates that no w exists capable of discriminating perfectly 440 native sequences from the 14 million 
decoy sequences. We repeated the same experiment using the set of 572 native proteins from reference [20] and 
28 261 307 sequence decoys. The result is also negative. 



Learning Nonlinear Kernel potential. To overcome the problems associated with linear potentials, 
we use the same set of 440 native proteins and 14 million decoys to obtain nonlinear kernel folding and design 
potentials. In both cases, wc succeeded in finding a function in the form of Equation (3) that can discriminate 
all 440 native proteins from 14 million decoys. 

Unlike statistical potentials where each native protein structure in the database contribute to the empirical 
potential, only a subset of native structures contribute and have ajv 7^ 0. In addition, a small fraction of decoys 
also contribute to the potential function. Table 1 list the details of the potential, including the numbers of native 
proteins and decoys that participate in Equation (3). These number represent about 60% of native proteins and 
< 0.1% of decoys from the original training data. 



Discrimination Tests for Folding Potential. Blind test in discriminating native proteins from decoys 

using an independent test set is essential to assess the effectiveness of folding potentials. To construct a test set, 
we first take the entries in Whatif99 database that are not present in Whatif98. After eliminating proteins 
with chain length less than 46 residues, we obtain a set of 201 proteins. These proteins all have < 30% sequence 
identities with any other sequence in either the training set or the test set proteins. Since 139 of the 201 test 
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Design Potential 


Folding Potential 






Strategy 1 


Strategy 2 


Strategy 3 








0-2 = 250.0 


0-2 = 227.3 


0-2 = 138.9 


0-2 = 227.3 


Num. of 


Natives 


260 


258 


347 


268 


Support Vectors 


Decoys 


2457 


2877 


4709 


2560 


Range of 


Natives 


1.070 ~ 0.9996 


0.9993 ~ 5.762 


0.9991 ~ 3.314 


0.9990 ~ 4.485 


Energy Values 


Decoys 


-9.495 ~ 0.7979 


-9.485 - 0.9339 


-6.655 - 0.9882 


-8.379 - 1.039 


Range of Smallest Energy Gap 


0.2025 - 14.56 


0.06624 - 14.55 


0.01226 - 9.237 


0.0610 - 11.36 



Tabic 1: Details of learning nonlinear kernel folding and design potentials. The number of native proteins and decoys 
with non-zero entering the potential function is listed. These native proteins and decoys are called support vectors. 
The range of the energy values of natives and decoys are also listed, as well as the range of the energy gaps between 
the native protein and the decoy with the lowest energy. 

proteins have multiple chains, we use the same criteria applied in training set selection to exclude 7 proteins 
with > 30% Contact Ratio or with > 10% residues missing coordinates in the PDB files. This leaves a smaller 
set of test proteins of 194 proteins. Using gapless threading, we generate a sets of 3 096 019 structure decoys 
from the set of 201 proteins. This is a superset of the decoy set generated using 194 proteins. 

For comparison, we also test the discrimination results of the optimal linear potential taken as reported in 
reference [20] , as well as the statistical potential developed by Miyazawa and Jernigan. Here we use the contact 
definition reported in [20], that is, two residues are declared to be in contact if the geometric centers of their 
side chains are within a distance of 2.0 - 6.4 A. 





Misclassified Natives 


Misclassified Natives 


Kernel Folding Potential 


3/194 


8/201 


Tobi & Elber*^ 


7/194 


13/201 


Miyazawa & Jernigan 


85/194 


92/201 


Kernel Design Potential 


7/194 


13/201 



Table 2: The number of misclassified protein structures using nonlinear kernel folding potential, optimal linear potential 
taken as reported in [20], and Miyazawa-Jernigan statistical potential [9] among the test set of 194 proteins and the set 
of 201 proteins. The latter include proteins with more than 30% interchain contacts and proteins with > 10% missing 
coordinates. We also list performance of kernel design potential for structure recognition. 

To test nonlinear folding potential functions for discriminating native proteins from structure decoys in both 
the 194 and the 201 test sets, we take the structure s from the conformation-sequence pair (s, ajv) with the lowest 
energy as the predicted structure of the native sequence. If it is not the native structure sjy, the discrimination 
failed and the folding potential does not work for this protein. The results of discriminating native structures 
using nonlinear folding potential are summarized in Table 2. There are 3 and 8 misclassified native structures 
for the 194 set and 201 set, respectively. These correspond to a failure rate of 1.5% and 4.0%, respectively. The 
optimal nonlinear kernel folding potential performs better than the optimal linear potential based on calculation 
using potential values taken as reported in reference [20] (failure rates 3.6% and 6.5% for the 194 set and 201 
set, respectively). Consistent with previous reports [61], statistical potential has about 43.8% (81 out of 194) 
and 43.2% (87 out of 201) failure rates for the 194 set and the 201 set, respectively. 

Discrimination Tests for Design Potential. We use the same 194 set and 201 set of natives proteins 
and generate a set of 3 096 019 sequence decoys for testing the design potential. We take the sequence a from 
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Misclassified Natives 


Misclassified Natives 


Kernel Design Potential* 


14/194 


20/201 


Tobi & Elber-^ 


37/194 


44/201 


Miyazawa &; Jernigan 


81/194 


87/201 


Kernel Folding Potential 


24/194 


30/201 



Table 3: The number of misclassified protein sequences using nonlinear kernel design potential, optimal linear potential 
taken as reported in [20], and Miyazawa-Jernigan statistical potential [9] among the set of 194 proteins and the set of 
201 proteins. The latter include proteins with more than 30% interchain contacts and proteins with > 10% missing 
coordinates. The nonlinear kernel design potential is the only function that succeeded in perfect discrimination of the 
440 native sequences from a set of 14 million sequence decoys. 



the conformation-sequence pair (sN.a) for a protein with the lowest energy as the predicted sequence. If it is 
not the native sequence a^, the discrimination failed and the design potential does not work for this protein. 

Sequence decoys obtained by gapless threading are quite challenging, since all native contacts of the protein 
structures are maintained. No linear design potential function can be found using linear programming method 
that would succeed in the challenging task of identifying all 440 native sequences in the training set. 

We succeeded in obtaining nonlinear design potential capable of discriminating all of the 440 native sequences 
(Table 4). It also works well in the test set. It succeeded in correctly identifying 92.8% (180 out of 194) of native 
sequences in the independent test set. This compares favorably with results obtained using optimal linear folding 
potential taken as reported in [20], which succeeded in identifying 80.9% (157 out of 194) of the test set, and the 
Miyazawa-Jernigan statistical potential (success rate 58.2%, 113 out of 194). 

Discrimination Test Using Challenging Decoys. For the protein potentials derived with simple 

decoys generated by gapless threading, a more challenging test is to discriminate native proteins from an ensemble 
of explicitly generated three dimensional decoy structures with a significant number of near-native conformations 
[7,62]. Here we evaluate the performance of nonlinear folding and design potential using three decoy sets from 
the database "Decoys 'R' Us" [63]: the 4state_reduced set, the Lattice_SSFIT set, and the lmsd set. We 
compare our results in performance with results reported in literature using optimal linear potential [64] and 
statistical potential [9] (Table 4). For the 4state_reduced set of decoys, nonlinear design potential has the best 
performance in terms of identifying the native structure. The only misclassified protein lsn3 has three disulfide 
bonds, which are not modeled explicitly in the protein description of a vector in R^^". The correlation of root 
mean square distance of conformations to the native structure and energy value in the 4STATE set are shown in 
Fig 4. For proteins lr69, 2cro, 3icb, and Ictf, the correlation coefficient between the rmsd values and the 
energy values are good. We note that the nonlinear potential is obtained by learning from training decoys that 
are obtained by gapless threading. It is likely that nonlinear kernel potential can be further improved if more 
realistic structure decoys are included in training. 

Nonlinear Potential Function for Folding and Design. Because nonUnear potential is expressed 

as a kernel expansion of native protein and decoys, structure decoys and sequence decoys in general lead to 
different protein functions. For example, the contact count vectors c can be very different for a sequence decoy 
of a protein and a structure decoy of the same protein. The potential surface defined by the folding potential 
and the design potential therefore may be different. There are 268 out of 440 native proteins participating in 
folding potential function with a value ranging from 0.01 - 28.04, of which 185 (65%) are between 0.01 and 2.00. 
There are 347 out of 440 native proteins participating in design potential function, with a value ranging from 
0.02 to 110.64, of which 269 (77%) arc between 0.02 and 2.00. Therefore, the majority of the native proteins have 
similar a values for both folding and design potentials. Fig 5 shows the difference Aaj of the coefficient for 
protein i appearing in both folding potential and design potential. For the majority of the native proteins, Aa, 
values are small. That is, most native proteins contribute similarly in design potential and in folding potential. 
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1. 4state_reduced 


Protein 


# of decoys 


KFP 


KDP 


MJ 


TE-13 


Ictf 


631 


1/3.31(0.51) 


1/3.26(0.53) 


1/3.73 


1/4.20 


lr69 


676 


1/3.52(0.58) 


1/4.04(0.62) 


1/4.11 


1/4.06 


lsn3 


661 


3/2.44(0.35) 


6/1.91(0.30) 


2/3.17 


6/2.70 


2cro 


675 


8/2.24(0.56 


1/2.93(0.63) 


1/4.29 


1/3.48 


3icb 


654 


1/2.64(0.71) 


1/2.73(0.69) 






4pti 


688 


1/3.85(0.40) 


1/3.19(0.47) 


3/3.16 


7/2.43 


4rxn 


678 


1/2.71(0.49) 


1/2.29(0.48) 


1/3.09 


16/1.97 


2. Iattice_ssfit 


Protein 


# of decoys 


KFP 


KDP 


MJ 


TE-13 


Ibeo 


2001 


3/3.04 


10/2.61 






Ictf 


2001 


1/4.65 


1/4.73 


1/5.35 


1/6.17 


Idkt 


2001 


2/3.42 


3/3.12 


32/2.41 


2/3.92 


Ifca 


2001 


8/2.89 


21/2.67 


5/3.40 


36/2.25 


Inkl 


2001 


1/3.68 


1/3.81 


1/5.09 


1/4.51 


Ipgb 


2001 


1/4.67 


1/4.40 


3/3.78 


1/4.13 


Itrl 


2001 


32/2.34 


175/1.34 


4/2.91 


1/3.63 


4icb 


2001 


1/4.52 


1/4.62 






3. Imsd 


Protein 


# of decoys 


KFP 


KDP 


MJ 


TE-13 


IbOn-B 


498 


45/1.33 


129/0.61 






Ibba 


501 


6/-3.63 


501/-2.97 






Ictf 


498 


1/3.38 


1/3.48 


1/3.86 


1/4.13 


Idtk 


216 


154/0.63 


178/-1.13 


13/1.71 


5/1.88 


lfc2 


510 


501/-4.00 


499/-2.90 


501/-6.24 


14/2.04 


ligd 


510 


1/4.05 


1/3.85 


1/3.25 


2/3.11 


Ishf-A 


438 


2/2.64 


2/2.44 


11/2.01 


1/4.13 


2cro 


501 


1/2.77 


1/5.13 


1/5.07 


1/3.96 


2ovo 


348 


1/1.21 


67/0.93 


2/3.25 


1/3.62 


4pti 


344 


7/1.57 


47/0.97 







Table 4: Results of discrimination of native structures from decoys using nonlinear kernel potentials. The decoy sets 
include the 4statejreduced set, the Lattice_ssfit set, and the lmsd set [63]. The energy rank of the native 
structure and its z-score are listed. The correlation coefficient R is also listed in parenthesis for the 4state_reduced 
set. KFP stands for kernel folding potential, and KDP stands for kernel design potential. TE-13 potential is linear 
distance based potential optimized by linear programming, taken as reported in [64], and MJ potential is statistical 
potential as reported in [9]. Results for TE-13 potential and Miyazawa-Jernigan potential are taken from Table II of 
[64]. 
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Figure 4: The energy values of decoys and native proteins in the 4state_reduced set by nonlinear design potentials 
and their correlation with the rmsd values to the native structures. 

This is expected, because the main differences between the two potentials are due to differences in decoy sets. 
The native proteins that contribute most differently to folding potential and design potential are small proteins. 
For example, there are 34 native proteins whose |Aa| > 5.0, and they all come from the top 100 smallest protein 
chains. Table 5 listed 20 proteins with highest a values for both kernel folding and design potentials. They are 
all small proteins and there are 15 out of the 20 proteins that appear both in folding and design potentials. It 
is possible that the energy values by kernel folding potential and by kernel design potential may be similar for 
many structure-sequence pairs (s, a). Figure 6a shows that the test 194 proteins have similar energy values by 
the kernel folding and kernel design potentials. 

We also compare the energy value of the potential functions for each of the 210 unit vector c\ ={1,0,..., 0}^, • • •, 
and C210 = {0,...,!}"^. We normalize these values so maxiJ(cj) = 1 for both potentials (Fig 7a). There is 
strong correlation (R = 0.91) for folding and design potentials. 

However, other methods reveal that kernel folding and design potentials are different. One method is to 
compare the energy values of a subset of decoy structures that are challenging. That is, we compare energies of 
decoys with aj 7^ 0. Fig 7b shows that for decoys appearing in the design potentials, there is little correlation in 
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Figure 5: The difference in contribution to the potential function for native protein structures that participate in both 
folding and design potentials. They are sorted by Aa = a^esign ~ "^folding- "^^^ majority of them have Aa close to 0. 

energy values calculated by design potential and by folding potential. Similarly, there is no correlation between 
energy values calculated by folding potential and by design potentials for the set of structure decoys entering 
the design potential function (Fig 7c). It seems that although the values of oats are similar for the majority 
of the native proteins, design potential and folding potential can give very different energy values for some 
conformations. This suggests that the overall fitness for design and folding potential may be somewhat different. 

Comparison with Other Potentials. Potential functions derived from kernel models have the form of 

H{c) = E dev^d ' -^(c, Cd) - Eatg^v"" ■ -^(c, c„), which cannot be directly compared with other potential 
functions of the form H{c) = w ■ c. Here we follow the same approach as above and compare the energy values 
of the unit vectors, the native proteins, and the decoys using different potential functions. The energy values 
by nonlinear folding and design potentials have very little correlation with energy values evaluated using cither 
optimal linear folding potential [20] or statistical potential [9] (Fig 6c-f). It seems nonlinear potentials developed 
in this work contains information absent in other potential functions. 



5 Discussion 

A basic requirement for computational studies of protein folding and protein design is an effective potential 
function, which allows the search and the identification of native structures and native sequences. Current 
empirical potential functions are based on weighted linear sum of pairwise interactions. A major flaw of such 
potentials is that they cannot recognize the native structures of a large number of proteins from alternative 
decoy conformations [20,22]. 

There are several routes towards improving empirical potential functions. One approach is to introduce 
higher order interactions, where three-body or four-body interactions are explicitly incorporated in the potential 
function [50-52,65]. The effectiveness of this approach is likely to be assessed in the future with large scale 
validation studies similar to those carried out for pairwise potentials [8, 20, 22]. A different approach to improve 
empirical potential function is to introduce nonlinear terms. Recently, Fain et al uses sums of Chebyshev 
polynomials upto order 6 for hydrophobic burial and each type of pairwise interactions [27] . 

In this work, we propose a different framework for developing empirical nonlinear protein potential functions. 
Instead of using nonlinear function for each term of pairwise interactions, we use a set of simple Gaussian 
kernel functions located at both native proteins and decoys as the basis set. We regard the decoys as equivalent 
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Kernel Design Potential 


Kernel Folding 


Potential 


Index 


pdb 


a 


Class 


Number 


pdb 


a 


Class 


Number 






value 




of resides 




value 




of resides 


1 


Irop.a.pdb 


28.04 


a 


56 


Irop.a.pdb 


110.64 


a 


56 


2 


Ivie.pdb 


21.43 


b 


60 


Itgs.i.pdb 


91.76 


g 


56 


3 


ligd.pdb 


20.75 


d 


61 


2spc.a.pdb 


56.93 


a 


107 


4 


2spc.a.pdb 


20.12 


a 


107 


2ovo.pdb 


46.34 


g 


56 


5 


labo.a.pdb 


20.08 


b 


58 


lisu.a.pdb 


43.49 


g 


62 


6 


tail. pdb 


19.92 


a 


70 


6ins.e.pdb 


42.31 


g 


50 


7 


Imjc.pdb 


19.75 


b 


69 


tail. pdb 


40.77 


a 


70 


8 


2utg.a.pdb 


18.57 


a 


70 


2utg.a.pdb 


40.54 


a 


70 


9 


Itgs.i.pdb 


17.67 


g 


56 


Ivie.pdb 


35.03 


b 


60 


10 


Itig.pdb 


11.52 


d 


88 


ligd.pdb 


32.58 


d 


61 


11 


lr69.pdb 


11.40 


a 


63 


Itig.pdb 


23.71 


d 


88 


12 


Iten.pdb 


10.74 


b 


90 


Imjc.pdb 


21.99 


b 


69 


13 


Icyo.pdb 


10.55 


d 


88 


labo.a.pdb 


20.86 


b 


58 


14 


451c.pdb 


10.37 


a 


82 


Ifle.i.pdb 


19.94 


g 


47 


15 


lisu.a.pdb 


10.30 


g 


62 


Ihoe.pdb 


18.61 


b 


74 


16 


2ovo.pdb 


10.24 


g 


56 


2wrp.r.pdb 


16.68 


a 


104 


17 


2bop.a.pdb 


9.97 


d 


85 


451c.pdb 


16.29 


a 


82 


18 


1 hoe. pdb 


9.36 


b 


74 


1 blip, pdb 


16.08 


g 


45 


19 


Igvp.pdb 


9.31 


b 


87 


Icmb.a.pdb 


15.63 


a 


104 


20 


Iptf.pdb 


9.19 


d 


87 


Ifxd.pdb 


14.79 


d 


58 



a: All alpha proteins 

b: All beta proteins 

d: Alpha and beta proteins (a+b) 

g: Small proteins 



Table 5: The 20 proteins with highest a value from both kernel folding potential and kernel design potential. The 
residue number, and SCOP class also were listed. 



to the reference state or null model used in statistical potential. The expansion coefScients {aAr}, A/' e M and 
{q!£)}, Z) g X> of these Gaussian kernels determines the protein function. As long as the native proteins and decoys 
are represented as unique vectors c G R'', the Gram matrix of the kernel function is full-rank. Therefore, the 
kernel function effectively maps the protein space into a high dimensional space in which effective discrimination 
with a hyperplane is easier to obtain. The optimization criterion here is not Z-score, rather we search for the 
hyperplane in the transformed high dimensional space with maximal separation distance between the native 
protein vectors and the decoy vectors. This choice of optimality criterion is firmed rooted in a large body of 
studies in statistical learning theory, where expected number of errors in classification of unseen future test data 
is minimized probabilistically by balancing the minimization of the training error (or empirical risk) and the 
control of the capacity of specific types of functions of potential function [37-39] . 

This approach is general and flexible, and can accommodate other protein representation, as long as the final 
descriptor of protein and decoy is a rf-dimensional vector c. In addition, different forms of nonlinear functions 
can be designed using different kernel functions, such as polynomial kernel and sigmoidal kernels. It is also 
possible to adopt different optimality criterion, for example, by minimizing the margin distance expressed in 
1-norm instead of the standard 2-norm Euclidean distance. 

A useful observation obtained from this study is that sequence decoys obtained from gapless threading are 
quite challenging. In fact, we found that no linear potential function exists that can discriminate a training set 
of 440 native sequence from sequence decoys generated by gapless threading. The success of nonlinear potential 
in perfect discrimination of this training set native sequences and its good performance in identifying the native 
sequences in an unrelated test set of 194 proteins indicate that nonlinear kernel potential is a general strategy 
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Figure 6: Comparison of nonlinear kernel folding potential (KFP) and kernel design potentials (KDP) with Tobi-Elber 
(TE) optimal linear potential and Miyazawa-Jernigan (MJ) statistical potential. The energy values of the 194 proteins 
in the test set are calculated and scaled to 0.0 - 1.0. (a). The energy values by KFP and by KDP for the 194 proteins 
are highly correlated. The correlation coefficient is i? = 90. The energy values by (b) MJ and TE are also highly 
correlated. The correlation coefficient is i? = 0.95. The energy values by (c) MJ and KDP, (d) by TE and KDP, (e) 
by MJ and KFP, (f) by TE and KFP are all poorly correlated. This suggests that both the kernel folding and design 
potential functions are different from MJ and TE potentials. 



for developing effective potential function for protc;in scqncncc design. 

It is informative to examine the three misclassified proteins by the kernel folding potential (lbx7, Ihta, and 
3erd). Hirustasin lbx7 contain five disulfide bonds, which are not modeled explicitly by the protein description. 
Ihta (histone Hmfa) exists as a tetramer in complex with DNA under the physiological condition. Its nature 
structure may not be the same as that of a lone chain. The two terminals of this protein are rather flexible, 
and their conformations are not easy to determine. 3erd.a (estrogen receptor a ligand-binding domain) has 
extensive contacts with ligand. These unmodeled interactions may alter protein conformation. Among the 14 
native sequences misclassified by the kernel design potential (la73 . a, IbdS, Ibea, lbm8, IbnS.a, Ibvy.f, Icku.a, 
Idpt.a, Ihta, Imro.c, lops, Iqav.a, lubp.b, Sezm.a), several have extensive interchain interactions, although 
the contact ratio is below the rather arbitrary threshold of 30%: la73.a has Contact Ratio of 23%, Imor.c 
has 24%, lupb.b has 19%, and Iqav.a has 13%. It is possible that the substantial contacts with other chains 
would alter the confirmation of the protein. Amylase inhibitor Ibea contains 5 disulfide bond not explicitly 
modeled. Icku.a (electron transfer protein) contains an iron/sulfur cluster, which covalently bind to four Cys 
residues and prevent them from forming 2 disulfide bonds. These covalnt bonds are not reflected in the protein 
description. Ibvf (oxidoreductase) is complexed with a heme and an FMN group. The conformations of Icku.a 
and Ibvf . f may be different upon removing of these functionally important hetero groups. Altogether, there 
are some rationalization for 10 of the 16 misclassified proteins. 
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Figure 7: Comparison of nonlinear kernel folding potential and kernel design potential generated by optimized 
discrimination against decoys from gapless threading, (a) . The energy values of the nonlinear folding and design 
potentials for the 210 unit vectors are strongly correlated {R = 0.91). (b). The energy values by both design 
potentials and by folding potential for decoys that enter the nonlinear design functions are poorly correlated, (c). 
The energy values for decoys that enter the nonlinear folding functions are also poorly correlated. 



Our goal in this study is to explore an alternative formulation of potential function and assess the effectiveness 
of this new approach with experimental data. The nonlinear potential functions obtained in this study should be 
further improved. For example, unlike the study of optimal linear potential [20] , where explicitly generated three- 
dimensional decoys structures are used in training, we used only structure decoys generated by threading. The 
test results using the 4sTATE_REDUCED set and the LATTICE_SSFIT are comparable or better with other residue- 
based potential (see Fig 4 and Table 4). It is likely that further incorporation of explicit three-dimensional decoy 
structures in the training set would improve the protein potential. 

In summary, we show in this study an alternative formulation of protein function using a mixture of Gaussian 
kernels. We demonstrate that this formulation can lead to effective folding potential and design potential that 
perform well in independent tests. For protein sequence design where challenging decoys are available from 
gapless threading, nonlinear kernel potential can have perfect classification in the training set of 440 proteins, 
while linear potential and statistical potential failed. It also performs bettor in an independent test set of 194 
proteins and reduce the misclassification to 40% of that of optimal linear potential. Our results suggest that more 
sophisticated functional form other than the simple weighted sum of contact pairs can be useful for studying 
protein folding and protein design. This approach can be generalized for any other protein representation, e.g., 
with descriptors for explicit hydrogen bond and higher order interactions. 
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7 Appendix 

Lemma 1 For a potential function in the form of weighted linear sum of interactions, a decoy always has energy 
values higher than the native structure by at least an amount ofb>0, i.e., 

w-{cD-CN)>b for all {{cd - cn)\D G T> and N G Af} (6) 

if and only if the origin is not contained within the convex hull of the set of points {{cd — cn)\D e T> and N e 
AT}. 

Proof: Suppose that the origin is contained within the convex hull A = conv({c£) — cjv}) of {cd — c^} and 
Equation (6) holds. By the definition of convexity, any point inside or on the convex hull A can be expressed as 
convex combination of points on the convex hull. Specifically, we have: 

0= ^ Xco-CN ■ {cd - Cn), and ^Aco-cjv = li'^CD-cjv > 0. 

(cd—cn)&A 

That is, we have the following contradiction: 

= w O = W- ^ AcD -CN • (cd - Cjv) = ^ X(^cd,cn) ■ ■ i'^D - C^) > ^ Xc^-cn ' ^ = b- 
cd—cn CD— Cm CD— Cm 

Because the convex hull can be defined as the intersection of half hyperplanes derived from the inequalities, 
if a half hyperplanc; has a distance & > to the origin, all points contained within the convex hull will be on the 
other side of the hyperplane [43]. Therefore, w ■ {cd — Cn) > b will hold for all {{cd — cjv)}- ■ 
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